Bias- Variance Techniques for Monte Carlo Optimization: 
Cross-validation for the CE Method 
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1 Introduction 

In this paper, we examine the CE method in the broad context of Monte Carlo Optimization (MCO) [Er 



moliev and Norkin 1998 Robert and Casella 2004 and Parametric Learning (PL), a type of machine 



learning. A well-known overarching principle used to improve the performance of many PL algorithms is 



the bias-variance tradeoff Wolpert 1997 . This tradeoff has been used to improve PL algorithms rang- 



ing from Monte Carlo estimation of integrals Lepage 1978 , to linear estimation, to general statistical 
estimation Breiman 1996a|b . Moreover, as described by Wolpert and Rajnarayan 2007 , MCO is very 



closely related to PL. Owing to this similarity, the bias-variance tradeoff affects MCO performance, just 
as it does PL performance. 

In this article, we exploit the bias-variance tradeoff to enhance the performance of MCO algorithms. 
We use the technique of cross-validation, a technique based on the bias- variance tradeoff, to significantly 
improve the performance of the Cross Entropy (CE) method, which is an MCO algorithm. In previous 
work we have confirmed that other PL techniques improve the perfomance of other MCO algorithms 
[see Wolpert and Rajnarayan 2007 . We conclude that the many techniques pioneered in PL could be 



investigated as ways to improve MCO algorithms in general, and the CE method in particular. 

The rest of the paper is organized as follows. In Sec. [2] we present an overview of the bias-variance 
tradeoff. In Sec. [3] we describe a few ways to exploit this tradeoff, starting from the relatively simple 
case of Monte Carlo integration, and proceeding to the more complex case of MCO. We also describe 
the original exploitation of this tradeoff, as a way to improve PL algorithms. In Sec. [4] we describe how 
to use cross-validation, a particular technique based on the bias-variance tradeoff, to modify the CE 
method. Sec. [5] then presents performance comparisons between this modified version of the CE method 
and the conventional CE method. These comparisons are on continuous, multimodal, unconstrained 
optimziation problems. We show that on these problems, using the modified version of the CE method 
can significantly improve optimization performance of the CE method, and never worsens performance. 
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2 The Bias- Variance Tradeoff 

Consider a given random variable and a random variable that we can modify, <i>. We wish to use 
a sample <fi of $ as an estimate of a sample <f> of The mean squared error between such a pair of 
samples is a sum of four terms. The first term reflects the statistical coupling between $ and $ and 
is conventionally ignored in bias-variance analysis. The second term reflects the inherent randomness 
in $ and is independent of the estimator <&. Accordingly, we cannot affect this term. In contrast, the 
third and fourth terms depend on The third term, called the bias, is independent of the precise 
samples of both $ and <&, and reflects the difference between the means of $ and <f>. The fourth term, 
called the variance, is independent of the precise sample of and reflects the inherent randomness in 
the estimator as one samples it. These last two terms can be modified by changing the choice of the 
estimator. In particular, on small sample sets, we can often decrease mean squared error by introducing 
a small bias that causes a large reduction the variance. While most commonly used in machine learning, 
bias- variance tradeoffs are applicable in a much broader context and in a variety of situations. 



2.1 A Simple Derivation of the Bias- Variance Decomposition 

Suppose we have a Euclidean random variable $ taking on values <p distributed according to a density 
function p((f>). We want to estimate a certain value <fi, that we cannot access directly, and that was 
obtained by sampling p(</>). We can, however, access a different Euclidean random variable <i> taking on 
values cj) distributed according to p(4>). We can also modify the distribution of and we want to exploit 
the coupling between $ and $ to improve our estimate. Assuming a quadratic loss function, the quality 
of our estimate is measured by its Mean Squared Error (MSE): 

MSE($) = J p{^cj)){^(t)) 2 d^d(t). (1) 

In standard bias-variance analysis, the statistical coupling of $ and $ is simply ignored without any 
justification^] and the distribution p((f>,(j)) is replaced with the product of marginals, p(<p)p((f>) . So our 
equation for MSE reduces to 

MSE($) = [ p(j>)p((j))(j)-(i)) 2 d$d(j). (2) 



Using simple algebra, the right hand side of Eq. [2] can be written as the sum of three terms. The 
first is the variance of This term is justifiably ignored in bias-variance analysis since it is beyond our 
control in designing the estimator The second term involves a mean that describes the deterministic 
component of the error. This term depends on both the distribution of $ and that of <!>, and quantifies 
how close the means of those distributions are. The third term is a variance that describes stochastic 
variations from one sample to the next. This term is independent of the random variable being estimated. 



1 One could account for this coupling by using an additive correction term [see Wolpert 1997 
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Formally, up to an overall additive constant, we can write 

MSE($) = y"p(0)(0 2 -2^+0 2 )4, 

= J P (0)0 2 4 -2<P J p(4>)$ 4 + 4>\ 

= V(0) + [E(0)] 2 -20E(0) + 2 , 
= V(0) + [0-E(0)] 2 , 

S v ' 

= variance + bias 2 . (3) 

In light of Eq. [3j one way to try to reduce MSE is to modify an estimator to trade bias for variance. 
Some of the most well-known applications of such bias-variance tradeoffs occur in parametric machine 
learning, where many techniques have been developed to exploit that tradeoff. There are still some 
extensions of that tradeoff that could be applied in parametric machine learning that have been ignored 
by that community. 



3 Applications of the Bias- Variance Tradeoff 

In this section, we describe some applications of the bias-variance tradeoff. First, we provide a sim- 
ple, concrete example that elaborates how conventional bias-variance techniques ignore the statistical 
coupling described in the previous section. Then, we describe the application of bias-variance tradeoffs 
to Monte Carlo (MC) techniques for the estimation of integrals, where, as for all unbiased estimators, 
the bias-variance tradeoff reduces to simple variance reduction. Next, we introduce the field of Monte 
Carlo Optimization (MCO), and illustrate that there are subtleties involved that are absent in simple 
MC. Then, we describe the field of Parametric Machine Learning, which is mathematically identical to 
MCO, and describe how they ignore some of the subtleties pertaining to MCO, but apply bias-variance 
analysis nonetheless. 

3.1 Supervised Learning 

Consider the simplest type of supervised machine learning problem, where the aim is to accurately 
predict the behavior of an input-output system, based on several 'training examples'. In this example, 
we consider a finite input space X, and an output space Y which is just the space of real numbers, 
and a deterministic input-output system, or 'target function' / that maps each element of X to a single 
element of Y . To be precise, there is a 'prior' probability density function p(f) over target functions, 
and it gets sampled to produce some particular target function, /. Next, / is IID sampled at a set of m 
inputs to produce a 'training set' T> of input-output pairs. 

For simplicity of analysis, say we have a single fixed 'prediction point' x € X. Our goal in supervised 
learning is to estimate f(x), but / is not known. Accordingly, to perform the estimation the training 
set is presented to a 'learning algorithm', which in response to the training set produces a guess f(x) 
for the value f(x). 
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This entire stochastic procedure defines a joint distribution 7r(/, T>, f(x), f(x)). In order to analyze 
the performance of the learning algorithm, we marginalize to obtain a distribution ir(f{x),f{x)). Since 
f(x) is supposed to be an estimate of f(x), we can identify f(x) as the value </> of the random variable 4> 
and f(x) as the value <j> of <I>. In other words, we can define p(4>, 4>) — ir(f(x),f(x)). If we then compute 
the mean squared error in the estimate made by our learning algorithm for the value f(x), we get Eq. [T| 

Of course, in general, <!> and i.e., f(x) and f(x), are statistically dependent; if they weren't, the 
learning algorithm gains nothing from knowing T>. This dependence can be established by writing 



p(f(x)J(x)) = / dV p(f(x),f(x) | V) p(V), 

dVp(f(x)\f(x),V) p(f(x)\V) p(V), 

dVp{f{x)\V)p{f{x)\V)p{V). (4) 



The first two steps comprise a straightforward application of Bayes' rule. In the third step, the con- 
ditioning on f(x) is removed because the guess of the learning algorithm is determined solely by the 
training set V. Nevertheless, note that Eq. [4] exact, and in general is not the same as the product 



p(f(x))p(f(x)) = 



dVp(f(x)\V)p(V) 



dVp(f(x)\V)p(V) 



(5) 



As we described in Sec. [2] conventional bias- variance analysis approximates Eq. [4] by Eq. [5] 
3.2 Monte Carlo Integration 

Monte Carlo methods are often the method of choice for estimating difficult high-dimensional integrals. 
Consider a function /: X — > M, which we want to integrate over some region X C X, yielding the value 
F, as given by 

F= f dxf(x). 



x 



We can view F as a degenerate random variable with density function given by a Dirac delta function 
centered on F. Therefore, the variance of $ is 0, and Eq. [3] is exact. 



A popular MC method to estimate this integral is importance sampling Robert and Casclla 2004 
This exploits the law of large numbers as follows: i.i.d. samples x^ l \ i = 1, . . . ,m are generated from 
a so-called importance distribution h(x) that we control, and the associated values of the integrand, 
f(x^) are computed. Denote these 'data' by 

2? = {(xW,/(a;W), z = l,...,m}. (6) 

Now, 



F = [ dxh(x)^ 
Jx h{x) 



x 



1 f(x^>) 
= lim — \ ... with probability 1. 
nwoo m h(x^>) 

i—l v 1 
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Denote by <!> the random variable with value given by the finite sample average for T>: 

1 

We use <& as our statistical estimator for <I>, as we broadly described in Sec. [l] Assuming a quadratic 
loss function, L(<fi,(f>) = (</> — 0) 2 , the bias-variance decomposition described in Eq. [i] applies exactly. It 
can be shown that the estimator <I> is unbiased, that is, E<& = cf>, where the mean is over samples of h. 
Consequently, the MSE of this estimator is just its variance. The choice of sampling distribution h that 
minimizes this variance is given by [see Robert and Casella} |2004 



S x \f{x<)\dx<- 



By itself, this result is not very helpful, since the equation for the optimal importance distribution 
contains a similar integral to the one we are trying to estimate. For non-negative integrands f(x), the 



VEGAS algorithm Lepage 1978 describes an adaptive method to find successively better importance 
distributions, by iteratively estimating <I>, and then using that estimate to generate the next importance 
distribution h. In the case of this and other unbiased estimators, there is no tradeoff between bias and 
variance, and minimizing MSE is achieved by minimizing variance. 

3.3 Monte Carlo Optimization 

Instead of a fixed integral to evaluate, consider a parametrized integral 

F(9)= [ dxf 9 (x). 

Further, suppose we are interested in finding the value of the parameter 9 G © that minimizes F{9): 

9* = argmini^). 

In the case where the functional form of fg is not explicitly known, one approach to solve this problem is a 



technique called Monte Carlo Optimization (MCO) [see Ermoliev and Norkin 1998 , involving repeated 
MC estimation of the integral in question with adaptive modification^] of the parameter 9. 

We proceed analogously to the preceding section. Whereas in MC, there was no parameter 9 and 
we had a single Dirac-dclta distribution, we now have a set of such distributions, corresponding to the 
values 9. Accordingly, we introduce the 0-indexed vector random variable each of whose components 
$e has a degenerate Dirac-delta distribution about the associated value F{9). Next, we introduce our 
estimator random- variable, a similar ^-indexed vector random variable $ each of whose components $g 
can be sampled to estimate $g. Regardless of how $ is defined, given a sample of (j>, one way to estimate 
9* is 



— arg mm < 

eee 



We call this approach 'natural' MCO. 

2 The similarity to the CE method is quite clear. 
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For example, let I? be a data set as described in Eq. [6] Then for every 9, any sample of T> provides 
a sample of which is the associated estimate 

fa = F{6) = ■ ^ MX ) 



Given this choice for the 'natural MCO' approach is to estimate the optimal 9 as follows. 

r = argminlvM^. (7) 
b «6em^ h(x^) 

As we shall see, this does not work well in practice, and we therefore call this importance-sampling 
application of natural MCO 'naive' MCO. 

In general, consider any algorithm that estimates 9* as a single- valued function of <j>. The estimate of 
9* produced by that algorithm is itself a random variable, since it is a function of the random variable <I>. 
Call this random variable ©*, taking on values 9*. Any MCO algorithm is defined by ©*; that random 
variable encapsulates the estimate made by the algorithm. 

To analyze the error of such an algorithm, consider the associated random variable F(Q*), taking on 
the exact values of the integral F(9*). Since our aim in MCO is to minimize F(9), it is reasonable to 
propose that the cost of estimating 9* by 9* is nothing but the difference between F(9*) and the true 
minimal value of the integral, F(9*) — mine F(9). In other words, we adopt the loss function 

L(9*,9*) = F{9*)-F(9*). (8) 

This is in contrast to our discussion on MC integration, which involved quadratic loss. Up to an unknown 
additive constant, this loss function is nothing but F(9*). This additive constant F(9*) is fixed by the 
MCO problem at hand, is beyond our control, and is not affected by our algorithm. Consequently, we 
ignore that additive constant, and write out the associated expected loss: 

E(L) = ( d9* P (9*)F(9*). (9) 



Now change coordinates in this integral from the values of the scalar random variable 9* to the values 
of the underlying vector random variable <i>. The expected loss now becomes 



E(L) = / d<t>p{<f>)F{9* {<!>)). 
The natural MCO algorithm provides some insight into these results. For that algorithm, 
E(L) = J d<t>p(<j>)F(&rgmm<t>o) 

j> Sl d(j)8 2 . . . p(0 9l ,0 e2 ,...)F(argmin0 e ). (10) 



For any fixed 9, there is an error between samples 4>g and the true value F(9). Bias- variance consid- 
erations apply to this error, exacty as in the discussion of MC above. In MCO, however, we are not 
concerned with for a single component 9, but rather for a set of 9's. 
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The simplest such case is where the components 4>g are independent. Even so, arg mingle is dis- 



tributed according to the laws for extrema of multiple independent random variables, and this distribu- 
tion depends on higher-order moments of each random variable <j>g. This means that E(L) also depends 
on such higher-order moments. Only the first two moments, however, arise in the bias and variance for 
any single 9. Thus, even in the simplest possible case, the bias-variance considerations for the individual 
9 do not provide a complete analysis. 

In most cases, the components of <j> are not independent. Therefore, in order to analyze E(L), in 
addition to higher moments of the distribution for each individual 8, we must now also consider higher- 
order moments coupling the estimates <pe for different 9. Conventional bias-variance analysis for MCO 
is therefore incomplete on three fronts: ignoring the coupling between $ and ignoring higher order 
moments for individual 0's, and ignoring moments coupling different 6>'s. 

Due to the coupling between 0's, it may be quite acceptable for the individual components (f>0 to have 
both a large bias and a large variance, as long as the covariances are large. Large covariances would 
ensure that if some (f>g were incorrectly large, then <pg>, for all 9' =/= 9 would also be incorrectly large. 
This would preserve the ordering of 0's under So, even with large bias and variance for each 9, the 
estimator as a whole would still work well. If we could exploit this insight, it may be possible to come 
up with weaker requirements for accurate estimators. Nevertheless, we can ignore these insights, and 
impose a stronger requirement: design estimators <j)g with sufficiently small bias plus variance for each 
single 9. More precisely, suppose that those terms are very small on the scale of differences F{9) — F{6') 
for any 9 and 9' . Then, by Chebychev's inequality, we know that the density functions of the random 
variables and <&gi have almost no overlap. Accordingly, the probability that a sample (f>g — tfrgi has 
the opposite sign of F(9) — F(9') is almost zero. 

Evidently, E(L) is generally determined by a complicated relationship involving bias, variance, co- 
variance, and higher moments. Natural MCO in general, and naive MCO in particular, ignore all of 
these effects, and consequently, often perform quite poorly in practice. In the next section we discuss 
some ways of addressing this problem. 

3.4 Parametric Machine Learning 

There are many versions of the basic MCO problem described in the previous section. Some of the 
best-explored arise in parametric density estimation and parametric supervised learning, which together 
comprise the field of Parametric machine Learning (PL). In particular, parametric supervised learning 
attempts to solve 



Here, the values x represent inputs, and the values y represent corresponding outputs, generated accord- 
ing to some stochastic process defined by a set of conditional distributions {p(y | x), x e X}. Typically, 
one tries to solve this problem by casting it as a single-stage MCO problem, For instance, say we adopt 
a quadratic loss between a predictor z$(x) and the true value of y. Using MCO notation, we can express 
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the associated supervised learning problem as finding arg mine F{9), where 

l e {x) = J dy p(y | x) (z e (x) - y) 2 , 
fe{x) 
F{9) 



p(x) le(x), 
dx fe(x). 



(11) 

Next, the argmin is estimated by minimizing a sample-based estimate of the _F(#)'s. More precisely, 
we are given a 'training set' of samples of p(y \ x)p(x), {(x^ l \y l )i — l,...,m}. This training set provides 
a set of associated estimates of F(8): 

F(9)=-J2le(x^). 
m * — ' 

»=1 

These are used to estimate argming F((9), exactly as in MCO. In particular, one could estimate the 
minimizer of F(9) by finding the minimium of F{9), just as in natural MCO. As mentioned above, this 
MCO algorithm can perform very poorly in practice. In PL, this poor performance is called 'overfitting 
the data'. 

There are several formal approaches that have been explored in PL to try to address this 'overfitting 
the data'. Interestingly, none are based on direct consideration of the random variable and the 



ramifications of its distribution for expected loss (cf. Eq. 10). In particular, no work has applied the 
mathematics of extrema of multiple random variables to analyze the bias-variance-covariance tradeoffs 
encapsulated in Eq. [10] 

The PL approach that perhaps comes closest to such direct consideration of the distribution of 
F(9*) is uniform convergence theory, which is a central part of Computational Learning Theory [see 



Angluin 1992 . Uniform convergence theory starts by crudely encapsulating the quadratic loss formula 



for expected loss under natural MCO, Eq. [10] It does this by considering the worst-case bound, over 
possible p(x) and p(y | x), of the probability that F(0*) exceeds ming F(9) by more than k. It then 
examines how that bound varies with k. In particular, it relates such variation to characteristics of the 



set of functions {fg : 9 € ©}, e.g., the 'VC dimension' of that set [see Vapnik 1982 1995 



Another approach is to apply bias-plus- variance considerations to the entire PL algorithm 0*, rather 
than to each <&e separately. This approach is applicable for algorithms that do not use natural MCO, 
and even for non-parametric supervised learning. As formulated for parameteric supervised learning, 
this approach combines the formulas in Eq. [TT] to write 



F{9) 



dx dy p{x)p{y \ x){z e {x) - y) 2 



This is then substituted into Eq. [9] giving 

E(L) - [ d9*dxdyp(x)p(y\x)p(9*)(z § 4x)-y) 2 



dx p{x) 



d9*dyp(y\x)p(9*)(z^(x)^y) 1 



(12) 



The term in square brackets is an x-parameterized expected quadratic loss, which can be decomposed 
into a bias, variance, etc., in the usual way. This formulation eliminates any direct concern for issues like 
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the distribution of extrema of multiple random variables, covariances between and <&g' for different 
values of 6, and so on. In some ways, this imposes stronger conditions on the estimators we design, 
and consideration of the issues outlined above may give weaker conditions on how to increase estimation 
accuracy. 

There are numerous other approaches for addressing the problems of natural MCO that have been 



explored in PL. Particulary important among these are Bayesian approaches, e.g., Buntine and Weigend 



1991 , Berger 1985 , Mackay 2003 . Based on these approaches, as well as on intuition, many powerful 
techniques for addressing data-overfitting have been explored in PL, including regularization, cross- 
validation, stacking, bagging, etc. Essentially all of these techniques can be applied to any MCO 
problem, not just PL problems. Since many of these techniques can be justified using Eq. |12[ they 
provide a way to exploit the bias-variance tradeoff in other domains besides PL. 

4 PLMCO-CE 

In this section, we first present a review of the CE method for continuous problems, and describe various 
components in the terms used thus far. 



Kroese et al. 2006 describe the application of the CE method to continuous multi-extremal problems. 
They describe the use parametrized continuous distributions, in particular, multivariate Gaussian and 
multivariate Gaussian mixtures, to perform adaptive importance sampling for optimization of continuous 
functions. The problem at hand, in our notation, is as follows. Find 

minimize-rg^ G(x). (13) 

This is then converted to an Associated Stochastic Problem (ASP) over (9-parametrized probability 
distributions qg over X. Let 7* = min xe x[G(x)], and let /{.} be the indicator function. We want to 

maximize*? E, s /{g(x)< 7 *} (14) 

This solution to this problem is a degenerate Dirac-delta function centered on the optimal x, provided 
that the set of allowable 9 permits parametrization of such degenerate distributions. Note also that 
sampling this degenerate distribution gives the answer to the original problem, so in a way, the two 
problems are equivalent. 

The ASP is actually solved using a homotopy method, where one solves a sequence of optimization 
problems with progressively decreasing value of 7. The algorithm proceeds as follows. 8 is initialized to 
some 8q ■ At each iteration t > 0, a set of samples is drawn from qg t , and Jt+i is chosen to correspond to 
the the best n percentile of these samples, and and 8 t+ i is chosen so as to minimize a KL divergence (or 
cross entropy) from the suitably normalized indicator distribution (also called a Heaviside distribution) 
given by 

p 7 (x) oc e 7 (x) = i G ^~. 7 ' (15) 

I 0, otherwise. 

The problem of computing 8 t +i is actually an MCO problem: we use a set of samples to search for 
the 8 that minimizes a parametrized integral 

8t +1 = argminKL(p 7 ||<7g) = — argmin / dxp^(x) In (qg(x)) . (16) 
'' 6 J , x 
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If we keep track of the qg that each of the elite samples was generated from, we can then properly use 
importance sampling to estimate the above integral. Also, the normalization constant turns out to be 
irrelevant, enabling the use of 7 rather than p~ r The importance-sampled estimate of of interest is 
therefore 



Here, qg k , is the actual parametrized distribution that generated the i th sample. In the CE method, 
however, the denominator of the likelihood ratio is for some reason ignored, and the importance ratio is 
just set to unity for the elite samples, and zero for all the others. 

4.1 Cross-validation for the CE Method 

Using the notation from Sec. [3] we see that the CE method actually performs naive MCO: it uses 
importance sampling to generate a finite-sample estimate of a parametrized integral, then estimates the 
integral-optimizing parameters by simply minimizing the finite-sample sum. It is known from extensive 
experience with learning algorithms in the PL community that such an approach is bound to perform 
poorly with small sample sizes. In other words, it will suffer from large errors caused by large variance, 



stemming from overfitting the data. Indeed, to prevent such overfitting, Kroese et al. 2006 recommend 
the use of 'dynamic smoothing' to prevent 'premature shrinking' of the distributions. Also, the percentile 
k must be set appropriately: if it is too large, convergence to an optimum will be very slow, and if it is 
too small, the algorithm will converge prematurely, either to a local optimum, or worse, to a point that 
is not even a local minimum, and make extremely slow progress towards a minimum. 

The percentile n, or equivalently, the elite size 7V chtc , can be thought of as a hyperparameter: a 
data-independent parameter that affects algorithm performance. In a PL context, hyperparameters are 
often set using the technique of cross-validation, which works as follows: the given data is partitioned 
into two parts, a training set and a 'held-out' test set. The learning algorithm (also called a training 
algorithm) is given the training data, and the estimated optimal parameters 9* it generates are tested 
by estimating the value of the parametrized integral using the test data. The hyperparameters are then 
chosen so as to optimize this 'held-out' performance. This can be done many ways, helpfully named 
70-30 cross-validation, fc-fold cross-validation, or leave-one-out cross-validation. As expected, in 70-30 
cross-validation, the given data is randomly partitioned into two parts containing 70% and 30% of the 
data. The larger partition is used for training, and the smaller one for testing. In fc-fold cross-validation, 
the data is divided into k equal-sized partitions, k — 1 of these are used for training, and the last one 
is used for testing. This procedure is repeated k times, so that there have been k estimates generated 
and tested; the hyperparameters are chosen to optimize average held-out performance. In lcave-one-out 
cross-validation, all but one data point are used for training, and the left-out point is used for testing. 
Of course, leave-one-out cross-validation only works if the performance measure makes sense on a single 
point. In some cases, such as prediction error, it does. In this paper, we use 4-fold cross-validation to 
trade bias and variance of the CE 'training algorithm' to dynamically pick the percentile value n t for 
elite samples. For our experiments, we used a fixed set of parameters to implement dynamic smoothing, 
though these too could be chosen using cross-validation. 
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In the case of mixture distributions, Kroese et al. 2006 use a technique called augmentation, where 
each elite sample is presumed to have come from a particular component of the mixture. They don't 
specify how exactly they perform the partitioning of elite samples. Instead, we use the well-known 
Expectation Maximization (EM) algorithm that is widely used for precisely this purpose: maximizing 
likelihood (or cross-entropy) using mixture distributions. Broadly speaking, instead of deterministically 
partitioning the elite samples, the EM algorithm iteratively 'soft-partitions' the elite samples using 
probabilities over a latent variable to specify which mixture component they arose from. The number 
of mixture components specifies what's called a model class. In addition to picking hyperparameters, 
cross-validation is also used to select model classes, and this process is called model selection. We use 
cross-validation to adaptively pick the number of mixing components in the Gaussian mixture qe t . 

In our case, we want to use cross-validation to pick 7, or equivalently, k. Therefore, rather than 
use E I?e /{G( 2 ,)< 7 } as the parametrized integral to optimize, we choose E ge [G(x)]. Even though we use a 
different ASP, note that the resulting optimization problem is formally equivalent to the CE ASP: they 
share the same optimum point 9, but differ in optimum value. We do, however, still use the CE method 
as our training algorithm. In other words, we use varying k in the CE algorithm to generate estimates 9* 
of the optimal 9, and use cross-validation to choose that value re that minimizes the importance-sampled 
estimate of E qe [G(x)] on the held-out data. In other words, pick k to minimize 



G(x^). (18) 



Note that the sum is over the held out samples, denoted by the subscript ho on the summation limit. 
Also note that we do keep track of the actual likelihood for each sample generated, and use the correct 
likelihood ratio in our importance-sampled estimate for held-out performance. 

We call this CE algorithm, where hyperparameters and models are adaptively picked using cross- 
validation, PLMCO-CE, since PL techniques are applied to improve MCO performance of the CE algo- 
rithm. 



5 Results and Discussion 

In this section, we present performance comparisons between the conventional CE method and the new 
version presented in Sec. [4] on a few simple test problems for unconstrained optimization. 

5.1 Test Problems 

The test problems are all analytic, low-dimensional, multi-extremal functions. The problems varied in 
dimensionality from 4 to 8, and each of them has properties that make it difficult for local optimizers 
to find the global optima: some, such as the Woods and Rosenbrock problems, are badly-scaled, others 
have multiple local minima, with the worst minima having the largest basin of attraction for gradient- 
based algorithms, and so on. A few of these problems have been used for testing continuous multi- 



extremal optimization using the CE method Kroese et al. 2006 . The 4-dimcnsional problems are the 



n-dimensional (extended) Rosenbrock, the Woods function, and the Shekel family of functions. The 
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Hougcn function is 5-dimcnsional. Wc also experimented with the 6-dimcnsional Hartman function and 
the 8-dimensional Rosenbrock. The n-dimensional extended Rosenbrock is defined as 



G(x) = [(! - x i) 2 + 10 °(^ 2 " ^+i) 2 ] 
»=i 

The Woods function is defined as 

G(x) = 100(x 2 -x 1 ) 2 + (l-x 1 ) 2 + 90(x A -x 2 3 ) 2 + {l-x 3 ) 2 
+10.1[(1 - x 2 ) 2 + (1 - Xi) 2 ] + 19.8(1 - x 2 )(l - Xi ), 



(19) 



(20) 



The Hougen function is described by Kroese et al. [2006] . The others are described in detail by Dixon 



and Szego 1978 



5.2 Algorithms Tested 



We versions of CE algorithm described by Kroese et al. 2006 , by varying value of the elite size N ellte 
and the number of components in the Gaussian mixture. We tested three values of N ehte as a fraction of 
the number of samples taken: 5, 10, and 15%. For each of these, we experimented with a single Gaussian 
and a 3-componcnt Gaussian mixture. In the attached plots, the single Gaussian algorithms are denoted 
by CESxx, where xx represents N chtc as a percentage of the population size, and the mixture-based 
algorithms are denoted by CEMxx. We also tested two PLMCO-CE algorithms, which dynamically 
chose a good value of N ellte at each iteration. These algorithms used cross-validation to pick the value 
of jV ellte that optimized estimated held-out performance, i.e., Eqg[G(x)], as described in the preceding 
section. The version using Gaussian mixtures also dynamically chose the number of mixing components 
(between 1 and 3) using cross-validation. For a given number of mixing components, the algorithm used 
cross-validation to find the optimal value of jV clltc . This follows the rather standard machine-learning 
practice of optimizing hyperparameters for each model type using cross-validation, and then picking 
the model type that optimizes average held-out performance. The PLMCO-CE algorithms using single 
Gaussian and mixtures are denoted by CESX and CEMX respectively. 



5.3 Comparing Algorithm Performance 

We compare the performance of these algorithms by running each of them 100 times on all 8 problems. 
For a given trial of a test problem, the same set of initial samples was used for all algorithms. There- 
fore, performance differences between algorithms reflect directly on the choices made by the algorithms 
themselves. Of course, this common initial set was varied from trial to trial. 

We used two metrics to analyze the results of these experiments. The first is mean performance, 
but since we do not have the true mean, we plot the sample mean of these 100 runs, accompanied by 
the associated 95% confidence intervals shown as shaded regions. This, however, is not a comprehensive 
summary of algorithm performance: we also present a semilog plot of algorithm performance, where 
we plot (Gbest — G*) on a log scale against the number of function calls. This is standard practice in 
conventional optimization, but is seldom done in evolutionary optimization community. One reason to 
use a semilog scale is to be able to visually distinguish smaller values when displayed on the same plot as 
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larger ones: often, we want to know if some algorithm consistently finds numbers 0.001 smaller than some 
other algorithm, but the numbers themselves arc much larger than 0.001. In conventional optimization, 
the other major reason to use a semilog scale is to assess convergence rate. On these semilog plots, we 
cannot properly display confidence intervals for two reasons: the upper and lower confidence intervals 
will not be of the same size owing the logarithmic nature of the plot scale, and besides, the value of the 
mean minus the confidence interval may be negative, and this cannot be shown on a log-scale. Owing to 
the positivity of the variable (Gbcst — G*), this random variable has a rather skewed distribution, and 
in such cases, it is often more informative to consider the median, since the mean is dominated by the 
large values. So, our second metric is median performance, and this is shown plotted on a semilog scale, 
along with best and worst performance. Of course, these extremal events are those that occurred on our 
100 runs, and we cannot make rigorous claims regarding the true best-case or worst-case performance. 

5.4 Results 

We now present the performance comparisons discussed above. Fig. [T] summarizes performance on the 
6-dimcnsional Hartman problem. As we can see, the plot of mean performance does not show any 
discernible difference, but this is part of the reason for showing the semilog plot. We see that the median 
performance of the PLMCO-CE algorithm is many orders of magnitude better, both using Gaussian 
mixtures and a single Gaussian. This is not surprising on a multimodal problem, where 'overfitting' to 
a set of unlucky data would lead to convergence to a local optimum. 

Fig. [2] shows performance on the 4-dimensional Woods problem, which is unimodal, but is poorly 
scaled and has large flat regions. Using a single Gaussian, the mean performance of PLMCO-CE shows 
significant early gains, and yet final performance is no worse than the standard CE algorithm. This just 
means that PLMCO-CE quickly converges to the optimum. With mixtures, median performance, as 
usual, is greatly improved. 

Fig.[3]compares results on the Shekel family of functions, each of whose members is 4-dimensional and 
multimodal. As before, median performance with mixtures is vastly improved with PLMCO-CE, and 
is not adversely affected with a single Gaussian. The large variances in the mean performance indicate 
that both PLMCO-CE and standard CE with mixtures are getting trapped in local minima. This is to 
be expected while using mixtures on a multimodal problem. Nevertheless, PLMCO seems to ameliorate 
this problem. While it is clear that performance is greatly improved in the initial stages, we cannot 
be as certain about final mean performance. Still, the overlap in confidence intervals is rather small, 
and best-case performance is vastly improved, suggesting that using PLMCO advantageously skews the 
distribution of algorithm performance. With a single Gaussian, we can see that we need more trials: 
the variance in the means is not small enough to draw valid conclusions about final performance. In all 
cases, significant early gains from using PLMCO are indisputable. 

Figs. [4] and [5] compare performance on the Hougen and Rosenbrock functions. On the Hougen 
problem, as before, PLMCO provides significant gains in the early stages, but no significant final gains 
(all variants do eventually converge to the optimum). On the n-dimensional Rosenbrock, all algorithms 
perform very poorly, mainly owing to bad scaling (of curvature) of the problem. Even so, PLMCO does 
not seem to hurt the performance in any way, and arguably improves it slightly. While the 10-dimensional 
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Hartmanfe Function, 100 trials 



Hartman6 Function, 100 trials 
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d. Mean performance, single Gaussian. 



Figure 1: Performance comparison on Hartman6 function. 
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Figure 2: Performance comparison on Woods function. 
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Shekels Function, 100 trials 



ShekeI7 Function, 100 trials 



ShekellO Function, 100 trials 






Median performance, Gaussian mixtures. 



Shekels Function, 100 trials 



ShekeI7 Function, 100 trials 
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Mean performance, Gaussian mixtures. 



Shekels Function, 100 trials 



Shekel7 Function, 100 trials 





ShekellO Function, 100 trials 
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Figure 3: Performance comparison on Shekel family of functions. 
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Hougen Function, 100 trials 
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c. Median performance, single Gaussian. 
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Figure 4: Performance comparison on Hougen function. 
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Rosenbrock4 Function, 100 trials 



Rosen brock 4 Function, 100 trials 





a. Median performance, Gaussian mixtures, b. Mean performance, Gaussian mixtures. 



Rosenbrock4 Function, 100 trials 



Rosenbrock4 Function, 100 trials 





c. Median performance, single Gaussian. d. Mean performance, single Gaussian. 



RosenbrockS Function, 100 trials 



RosenbrockS Function, 100 trials 
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Median performance, single Gaussian. h. Mean performance, single Gaussian. 
Figure 5: Performance comparison on n-dimcnsional Rosenbrock. 
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Rosenbrock seems to have been successfully used by Kroese et al. 2006 , careful examination reveals 
that the 10 5 function evaluations were used to arrive within 0.02 of the optimum, which is somewhat 
inefficient for a ten-dimensional problem. 

5.5 Discussion and Conclusions 

In truth, this is not the whole picture. The relation between MCO and PL can be summarized as follows: 
each step of an MCO algorithm is essentially an entire PL problem. In addition, between iterations, 
MCO algorithms acquire more data by sampling, and this choice of samples is largely a heuristic such 



as adaptive importance sampling. In the PL community, the field of active learning Freund et al. 1997 



Dasgupta and Kalai 2005| comes close to addressing this problem, but this still does not address the 



iterative nature of the MCO algorithm. All of this is in addition to the three points mentioned in the 
text: while designing estimators for parametrized integrals, there are few or no published results in the 
literature vis-a-vis tailoring such estimators for a search process. In particular, as described in the text, 
the authors are unaware of any analyses that consider the higher moments of random variables required 
to analyze extremal values. The same is the case with the consideration of moments coupling different 
estimators. Even when all these considerations are ignored, our results seem to strongly imply that 
PL techniques improve MCO performance. That is, a bias-variance tradeoff at each iteration of the 
algorithm, for the individual 8-by-8 estimators alone, even though it does not fully capture the process, 
does improve MCO performance. 

In the particular case of the CE method, the use of PLMCO-CE strongly improves mean performance 
in the early stages of the algorithm. Moreover, these early gains are achieved without degrading final 
performance. This seems to bear out the main point of our PLMCO hypothesis: a straightforward 
application of well-known PL techniques to the individual iterations of an iterated MCO process should 
result in improved performance. Even if one were to examine the performance more closely, such as our 
investigation of median performance using a semilog plot, there are often significant gains from using 
PLMCO. These gains often occur when the problems are multimodal, or the data are sparse, and in 
general, when there is a real possibility of 'overfitting' due to lack of data. 

One would expect that the phenomenon of having to choose between models and tune hyperparame- 
ters is the norm rather than the exception, that hard optimization problems will probably be multimodal, 
requiring the use of mixture distributions (and perhaps even more hyperparameters or model classes), 
and using naive MCO in such cases would indeed result in much overfitting, and consequently, erratic 
performance. Our version of PLMCO is extremely rudimentary: we ignore almost all the complexities 
of an iterative search algorithm, and yet these results seem to indicate that there are significant gains 
to be expected. An investigation of the more subtle nuances of such iterative search process is likely to 
lead to even more significant gains. The authors believe that several interesting insights await discovery. 
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